Executive Summary

Swiftkey is a text messaging application that predicts the next word as you type a message. When the user types: "I went to the " : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

In this project, we use R text mining tools to build a application that predicts the next word entered into an shiny application. We use data scraped from blogs, twitter and news sites to build a language model to perform prediction. We predict the next word by calculating the maximum likelihood estimate (MLE) based on the language model.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% subsample of the data.
  3. Preprocess the data to remove stopword, convert to lowercase and other tasks
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Build a language model where a ngram (i.e. phrase) is a path for a search node in a tree.
  6. Calculate the maximum likelihood estimate (MLE) using the children of the search node.

Complete Dataset

The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.

filename line_count
1 en_US.blogs.txt 899288
2 en_US.news.txt 1010242
3 en_US.twitter.txt 2360148
4 Total 4269678

Processing Unstructured Text for Analysis

We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is “clean” the text prior with a series of text processing functions. We use the following preprocessing steps.

Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.

Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix each unique ngram. The row represent a document and the frequency that each ngram appears in the document.

Sampling the Dataset

The 4,269,678 lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% of the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.

source num_lines num_unique_words mean_word_freq median_word_freq
twitter 23602 8040 20 9
blogs 8993 12414 15 6
news 10103 12850 15 7

We see that the mean word frequency is nearly twice the median frequency for all data sources. The word frequency distribution has a long tail - as seen in the plot below. 937 of 22899 (4.1%) words cover 50% of all word instances in the dataset. The mean word frequency is heavily weighted by a few words that occur very frequently.

NGram Frequencies

Now we will look at the top bigrams, trigrams and 4-grams in the dataset.

Top Bigrams

The top bigrams include places like: new york and common phrases like: happy birthday and last night.

Top Trigrams

The top trigrams include places like: new york city and common phrases like: happy mothers day and rock n roll.

Top 4-grams

The top 4-grams do not appear to be commonly occurring phrases but reflect frequently occuring lines in the corpus. I will investigate using a weighting technique like TD-IDF to account for frequently occuring lines.

Word Prediction using an NGrams

We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.

Word Prediction for: ‘data’

Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.

results <- perform_search(ngram_tree, c("data"))
print(results)
##                   12                   10                  
## recommended_words "entry"              "streams"           
## likelihood        "0.0606060606060606" "0.0505050505050505"
##                   8                    7                   
## recommended_words "recovery"           "dating"            
## likelihood        "0.0404040404040404" "0.0353535353535354"
##                   7                   
## recommended_words "personalize"       
## likelihood        "0.0353535353535354"

Word Prediction for: ‘data entry’

Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.

results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
##                   6      6        
## recommended_words "just" "respond"
## likelihood        "0.5"  "0.5"

Next Steps

  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.

Appendix

Appendix 1: Subsampling Code

This code collects a 1% sample using a “coin flip” to decide which lines to choose.

# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
  print(sprintf("Reading %s", fn))
  lines <- readLines(fn)
  set.seed(123)
  print(sprintf("Read %s Length %s", fn, length(lines)))
  lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
  print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
  write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}

sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
                     "./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
                     "./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
                     "./data/final/en_US/sample/en_US.news.txt")